Let’s load the Prosper data and take a look at the number row and column.
Row count:
## [1] 113937
Column count:
## [1] 81
There are 113937 listing in the dataset with 81 variables. For the scope of this project, I am going to limit the number of variable. The question is which variables.
Looking at how prosper works[1], I add variables that fits the following criteria:
## 'data.frame': 113937 obs. of 15 variables:
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years: int 0 1 0 0 0 0 0 1 0 0 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ DaysWithCreditLine : num 5126 7159 4837 11926 4264 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ ListingCategory : Factor w/ 21 levels "Not available",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ AnnualIncome : num 37000 73500 25000 34500 115000 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
## $ ProsperRating : Factor w/ 7 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
Let’s take a look at the data summary:
## DelinquenciesLast7Years PublicRecordsLast10Years DebtToIncomeRatio
## Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.140
## Median : 0.000 Median : 0.0000 Median : 0.220
## Mean : 4.155 Mean : 0.3126 Mean : 0.276
## 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.: 0.320
## Max. :99.000 Max. :38.0000 Max. :10.010
## NA's :990 NA's :697 NA's :8554
## BankcardUtilization RevolvingCreditBalance DaysWithCreditLine
## Min. :0.000 Min. : 0 Min. : 1036
## 1st Qu.:0.310 1st Qu.: 3121 1st Qu.: 5702
## Median :0.600 Median : 8549 Median : 7297
## Mean :0.561 Mean : 17599 Mean : 7646
## 3rd Qu.:0.840 3rd Qu.: 19521 3rd Qu.: 9276
## Max. :5.950 Max. :1435667 Max. :24898
## NA's :7604 NA's :7604 NA's :697
## InquiriesLast6Months LoanOriginalAmount ListingCategory
## Min. : 0.000 Min. : 1000 Debt consolidation:58308
## 1st Qu.: 0.000 1st Qu.: 4000 Not available :16965
## Median : 1.000 Median : 6500 Other :10494
## Mean : 1.435 Mean : 8337 Home improvement : 7433
## 3rd Qu.: 2.000 3rd Qu.:12000 Business : 7189
## Max. :105.000 Max. :35000 Auto : 2572
## NA's :697 (Other) :10976
## EmploymentStatus AnnualIncome BorrowerRate Term
## Employed :67322 Min. : 0 Min. :0.0000 12: 1614
## Full-time :26355 1st Qu.: 38404 1st Qu.:0.1340 36:87778
## Self-employed: 6134 Median : 56000 Median :0.1840 60:24545
## Not available: 5347 Mean : 67296 Mean :0.1928
## Other : 3806 3rd Qu.: 81900 3rd Qu.:0.2500
## : 2255 Max. :21000035 Max. :0.4975
## (Other) : 2718
## ProsperRating ListingCreationDate
## C :18345 2013-10-02 17:20:16.550000000: 6
## B :15581 2013-08-28 20:31:41.107000000: 4
## A :14551 2013-09-08 09:27:44.853000000: 4
## D :14274 2013-12-06 05:43:13.830000000: 4
## E : 9795 2013-12-06 11:44:58.283000000: 4
## (Other):12307 2013-08-21 07:25:22.360000000: 3
## NA's :29084 (Other) :113912
Several sharp line on the amount, no surprise here, people tend to borrow in whole numbers. Interesting to note that 4000 is the most common amount people borrowed, followed by 10000 and 15000.
Most people borrow to consolidate their debts.
Most borrowers are employed.
At binwidth=1000, we can see sharp line around some amount, which make sense, since user tend to input a whole number. The histogram is skewed to the left.
Most borrower have no deliquencies in the last 7 years or public records in the last 10 years. If I remove the borrower with 0 deliquencies and 0 public records. I got:
While most borrowers has 0 deliquencies, there still almost 4000 borrowers who have at least 1 deliquencies in the last 7 years and 2000 borrowers have at least 1 public records in the last 10 years.
Debt to Income Ratio
A debt income ratio is the percentage of a consumer’s monthly gross income that goes toward paying debts. The data is capped at 10.01, debt-to-income ratio larger then 1000% will be returned as 1001%.
Removing the upper quantile on the data we got:
Revolving Credit Balance
Revolving Credit Balance is the total outstanding balance that the borrower owes on open credit cards or other revolving credit accounts.
Bankcard Utilization
Bankcard utilization is the sum of the balances owed on open bankcards divided by the sum of the card’s credit limits. Lower usually means better.
There are interestingly 2 peaks in the plot, first there are a lot of borrowers who have almost 0% Bankcard Utilization and then another peak near 100%. There are some borrowers who have utilization > 1.00 (100%).
Length of credit history is the number of days from the date when the oldest account on the borrower’s credit record was opened till today.
There is a credit line going up to 60 years.
Most loans have 36 months term.
The median for the borrower rate is 18.4% and mean 19.28%. There are 6 observation that has more then 40% borrower rate.
What is/are the main feature(s) of interest in your dataset?
The main features of the data are:
I chose this variables, because these variable is visible from the UI[1].
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I added ListingCreationDate. I added it just to see if there is “trend” in the behavior.
Did you create any new variables from existing variables in the dataset?
Yes, Days with credit line.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
All of the money related variables (LoanOriginalAmoun, RevolvingCreditBalance and AnnualIncome) are positively skewed. I do not transform the data for univariate analysis.
People who borrow > 25000 has annual income of >= 100000 looks like there some kind of rule, that if you borrow > 25000 the the minimal annual income is 100000.
That is not too informative. Let’s try too break the DebtToIncomeRatio into several bins.
A quick look at the newly created variable.
Let’s have another look at the relationship between DebtToIncomeRatio with BorrowerRate.
We can see that the BorrowerRate median increases the higher the DebtToIncomeRatio.
Let’s separate the borrower rate into bin as well.
Let’s take a look at the relation of Term with other variables.
Let’s take a look at relationship between DeliquenciesLast7Years and PublicRecordsLast10Years.
Let’s check ProsperRating relationship with other variables.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I wanted to see how several features affect the borrower rate, term and prosper rating. I put several data into “bins” as this makes it a bit easier to work with. By using this on borrower rate, debt to income ratio and delinquicies observations, we can paint a clearer picture on the relationship between features.
We can see for instance the borrower rate increases as debt to income ratio increases. The term seems to be related with loan original amount, the bigger the amount the longer the term.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
No.
What was the strongest relationship you found?
If we plot the ProsperRating against other features, the plot became much clearer. For instance we can quickly see that the debt to income ratio for rating AA will be lower then other rating.
The borrower rate shows an even clearer picture. The better your rating the lower your borrower rate.
There is some different from year to year feature distribution within rating. For instance the the borrower rate distribution we can see that the borrower rate for rating AA in 2013 and 2014 almost all between 0-10%. For rating B we seems to have borrower rate of 20-30% in 2011 and 2012, but 10-20% in 2013 and 2014.
This plot shows the effect of several factor on ProsperRating. For instance if a borrower have less bankcard utilization usually, he/she will get a better rating. On the other hand the longer you have credit line (DaysWithCreditLine) the better.
This plot shows the borrower rate distribution for borrower based on ProsperRating. If a borrower is rated AA, he/she will likely to have 0-10% borrower rate.
This plot is another look at plot 2 with added dimension of listing creation date. The plot shows the trend of borrower rate from 2009 and 2014 faceted by ProsperRating. We can see that if a borrower is rated AA in 2009 they can get 10-20% borrower rate. In 2013 and 2014, if you rated AA you will get 0-10% borrower rate. You are rated E in 2009 most borrower will get 30-40% rate, but in 2014 you can actually get 20-30% borrower rate.
The Prosper data has a lot of variables, for this scope of the project I limited the number of variables to investigate. The first part is to select which variables to investigate. After much thought, I use the variable that a borrower can actually see in the loan listing page[1]. I do this because I assume these are the metric that is important for lender to look at before actually lending money, so it is a good start.
Initially I wanted to show the relationship between the variables with borrower rate, for instance debt to income ratio vs borrower rate, bankcard utilization vs borrower rate. To ease the exploration I have put several variables into “bins”. Putting it into bins makes it easier for me to show the relationships between variables.
It is also much easier to show relationship based on ProsperRating then borrower rate. For instance if we faceted debt to income ratio with ProsperRating, it is easier to see that the lower your debt to income ratio the better is your rating. And then show the better you rating the better is you borrower rate.
Even on this limited number of variables, there is a lot of thing that we can investigate further.